Introduction

I wrote this to as a little fun unofficial project to practice my data science skills, looking at NBA scheduling patterns and their relationship to travel and rest. I used historical seasons and a draft portion of the 2024–25 schedule, focusing on density markers (e.g., 4 games in 6 nights), back-to-backs, rest-day distributions, cross-timezone movement, and travel distances. I also included an interactive schedule as a training for visualization. Additionally, I did a small machine learning exercise to quantify schedule-related effects on win probability.

Setup and Data

rm(list = ls())
options(scipen = 999)
library(tidyverse)
library(here)
library(lubridate)
library(slider)
options(warn = -1)
suppressPackageStartupMessages({
  library(dplyr)
  library(lme4)
  library(ggplot2)
  library(slider)
  library(scales)
  library(plotly)
  library(lubridate)
  library(ggpattern)
  library(cowplot)
})
# Note, you will likely have to change these paths. If your data is in the same folder as this project, 
# the paths will likely be fixed for you by deleting ../../Data/schedule_project/ from each string.
schedule       <- read_csv(here("schedule.csv"))
draft_schedule <- read_csv(here("schedule_24_partial.csv"))
locations      <- read_csv(here("locations.csv"))
game_data      <- read_csv(here("team_game_data.csv"))

Schedule Analysis

Here I isolate OKC’s draft 80-game schedule and tag days where a game is the 4th within the past 6 nights (overlapping windows allowed). The printed result gives the count, and a small table shows the dates flagged

# keep OKC only from the 2024-25 season
okc_1 <- draft_schedule %>%
  filter(team == "OKC") %>%
  mutate(gamedate = as_date(gamedate)) %>%
  arrange(gamedate)

# Set a window of 6 days
okc_1 <- okc_1 %>%
  mutate(
    games_past_6_nights = slide_index_dbl(
      .x = gamedate,
      .i = gamedate,
      .f = ~ length(.x),
      .before = days(5),
      .complete = FALSE
    ),
    is_4th_in_6_nights = games_past_6_nights == 4 # See if the game is the 4th one
  )

# Result
count_4in6 <- sum(okc_1$is_4th_in_6_nights, na.rm = TRUE)
dates_4th <- okc_1 %>%
  filter(is_4th_in_6_nights) %>%
  select(gamedate, opponent, home, win)

count_4in6
## [1] 26

26 4-in-6 stretches in OKC’s draft schedule.

sch <- schedule %>%
  mutate(gamedate = as_date(gamedate)) %>%
  arrange(team, season, gamedate)

# Set the 6-day window of team x season
sch_tagged <- sch %>%
  group_by(team, season) %>%
  mutate(
    games_past_6_nights = slide_index_dbl(
      .x = gamedate,
      .i = gamedate,
      .f = ~ length(.x),
      .before = days(5),
      .complete = FALSE
    ),
    is_4th_in_6 = games_past_6_nights == 4
  ) %>%
  ungroup()

# Sum up
agg_team_season <- sch_tagged %>%
  group_by(team, season) %>%
  summarise(
    games_played   = n(),
    four_in_six    = sum(is_4th_in_6, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    # Unify to per 82 games
    four_in_six_per82 = four_in_six * (82 / games_played)
  )

overall_avg_per82 <- mean(agg_team_season$four_in_six_per82, na.rm = TRUE)
overall_avg_per82
## [1] 25.09998

25.1 4-in-6 stretches on average.

team_avg <- agg_team_season %>%
  group_by(team) %>%
  summarise(
    avg_4in6_per82 = mean(four_in_six_per82, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_4in6_per82))

top_team    <- team_avg %>% slice(1)
bottom_team <- team_avg %>% slice(n())

paste(top_team$team, round(top_team$avg_4in6_per82, 1))
## [1] "CHA 28.1"
paste(bottom_team$team, round(bottom_team$avg_4in6_per82, 1))
## [1] "NYK 22.2"

Most 4-in-6 stretches on average: CHA (28.1) Fewest 4-in-6 stretches on average: NYK (22.2)

Analyzing the difference between most and least from Q3

mean_val <- mean(team_avg$avg_4in6_per82, na.rm = TRUE)
std_val  <- sd(team_avg$avg_4in6_per82, na.rm = TRUE)

# Get CHA and NYK values
cha_val <- team_avg %>% filter(team == "CHA") %>% pull(avg_4in6_per82)
nyk_val <- team_avg %>% filter(team == "NYK") %>% pull(avg_4in6_per82)

# Calculate Z-scores
cha_z <- (cha_val - mean_val) / std_val
nyk_z <- (nyk_val - mean_val) / std_val

# Output results
list(
  overall_avg_per82 = mean_val,
  std_val           = std_val,
  cha_z             = cha_z,
  nyk_z             = nyk_z
)
## $overall_avg_per82
## [1] 25.09998
## 
## $std_val
## [1] 1.564988
## 
## $cha_z
## [1] 1.922834
## 
## $nyk_z
## [1] -1.861908

The gap is noticeable but not shocking. The difference between Charlotte (28.1) and New York (22.2) corresponds to about two standard deviations from the league average (25.1, SD = 1.56). Charlotte’s z-score is +1.92 and New York’s is –1.86, meaning both are near the threshold of what would usually be considered unusual, but not extreme outliers. Beyond random variation, schedule design factors may help explain the gap. Travel distance, divisional matchups, and arena availability all influence how games are clustered. For instance, New York’s dense geographic location reduces travel and allows for a more evenly spread schedule, while Charlotte, as a smaller-market team, may face more compressed stretches such as back-to-backs or 4-in-6 clusters.

Analyzing BKN’s defensive eFG% in the 2023-24 season and situations where where their opponent was on the second night of back-to-back

schedule  <- schedule  %>% mutate(gamedate = as_date(gamedate))
game_data <- game_data %>% mutate(gamedate = as_date(gamedate))

# Mark back to back
schedule_b2b <- schedule %>%
  arrange(team, gamedate) %>%
  group_by(team) %>%
  mutate(prev_game = lag(gamedate),
         days_rest = as.numeric(gamedate - prev_game),
         b2b_flag  = ifelse(days_rest == 1, "second", "other")) %>%
  ungroup()

game_data_b2b <- game_data %>%
  left_join(
    schedule_b2b %>% select(gamedate, team, opp_b2b = b2b_flag),
    by = c("gamedate" = "gamedate", "off_team" = "team")
  )

# Filter out BKN as def
bkn_def <- game_data_b2b %>%
  filter(season == 2023, def_team == "BKN")

# Calculate overall eFG%
bkn_def_overall <- bkn_def %>%
  summarise(
    total_fgm  = sum(fg2made + fg3made, na.rm = TRUE),
    total_fg3m = sum(fg3made,            na.rm = TRUE),
    total_fga  = sum(fg2attempted  + fg3attempted,   na.rm = TRUE),
    def_eFG    = (total_fgm + 0.5 * total_fg3m) / total_fga
  )

# eFG% when opp is back to back
bkn_def_b2b <- bkn_def %>%
  filter(opp_b2b == "second") %>%
  summarise(
    total_fgm_b2b  = sum(fg2made + fg3made, na.rm = TRUE),
    total_fg3m_b2b = sum(fg3made,            na.rm = TRUE),
    total_fga_b2b  = sum(fg2attempted  + fg3attempted,   na.rm = TRUE),
    def_eFG_b2b    = (total_fgm_b2b + 0.5 * total_fg3m_b2b) / total_fga_b2b
  )

# result
result <- tibble(
  team = "BKN",
  season = "2023-24",
  def_eFG_overall = bkn_def_overall$def_eFG,
  def_eFG_opp_B2B = bkn_def_b2b$def_eFG_b2b
)
result
## # A tibble: 1 × 4
##   team  season  def_eFG_overall def_eFG_opp_B2B
##   <chr> <chr>             <dbl>           <dbl>
## 1 BKN   2023-24           0.543           0.535
  • BKN Defensive eFG%: 54.3%
  • When opponent on a B2B: 53.5%

Modeling: A Simple Win-Probability Lens

Below is a compact mixed-effects logistic model as a first-pass way to quantify schedule-related effects while controlling for opponent strength proxies.

library(readr)

# Safe division function to avoid dividing by zero or NA
safe_div <- function(num, den) ifelse(is.na(den) | den == 0, NA_real_, num / den)

# -------------------------
# (1) Row-level (per game) metrics
# -------------------------
game_data_metrics <- game_data %>%
  mutate(
    # Offensive metrics (for off_team)
    PPA       = safe_div(shotattemptpoints, shotattempts),     # ~2*TS%
    AST_pct   = safe_div(assists, fgmade),
    OREB_pct  = safe_div(reboffensive, reboundchance),
    DREB_pct  = safe_div(rebdefensive, reboundchance),
    TOV_pct   = safe_div(turnovers, shotattempts + turnovers),
    ORTG      = safe_div(points, possessions / 100),

    # Defensive metrics (for def_team; derived from opponent's offense)
    STL_rate_def   = safe_div(stealsagainst, possessions),
    STL_per100_def = 100 * STL_rate_def,
    BLK_pct_def    = safe_div(blocksagainst, fg2attempted),
    DRTG_def       = safe_div(points, possessions / 100)
  )

# -------------------------
# (2) Season-to-date (pre-game) cumulative metrics
# -------------------------

# (2a) Offensive S2D (by off_team)
off_cum_before <- game_data %>%
  arrange(season, off_team, gamedate, nbagameid) %>%
  group_by(season, team = off_team) %>%
  mutate(
    c_poss_off      = cumsum(coalesce(possessions, 0)),
    c_pts_off       = cumsum(coalesce(points, 0)),
    c_shotpts       = cumsum(coalesce(shotattemptpoints, 0)),
    c_shota         = cumsum(coalesce(shotattempts, 0)),
    c_ast           = cumsum(coalesce(assists, 0)),
    c_fgmade        = cumsum(coalesce(fgmade, 0)),
    c_oreb          = cumsum(coalesce(reboffensive, 0)),
    c_rebchance_off = cumsum(coalesce(reboundchance, 0)),
    c_tov           = cumsum(coalesce(turnovers, 0))
  ) %>%
  mutate(
    poss_off_before      = lag(c_poss_off),
    pts_off_before       = lag(c_pts_off),
    shotpts_before       = lag(c_shotpts),
    shota_before         = lag(c_shota),
    ast_before           = lag(c_ast),
    fgmade_before        = lag(c_fgmade),
    oreb_before          = lag(c_oreb),
    rebchance_off_before = lag(c_rebchance_off),
    tov_before           = lag(c_tov)
  ) %>%
  mutate(
    off_ORTG_before      = safe_div(pts_off_before, poss_off_before / 100),
    off_PPA_before       = safe_div(shotpts_before, shota_before),
    off_AST_pct_before   = safe_div(ast_before, fgmade_before),
    off_OREB_pct_before  = safe_div(oreb_before, rebchance_off_before),
    off_TOV_pct_before   = safe_div(tov_before, shota_before + tov_before)
  ) %>%
  ungroup() %>%
  select(season, nbagameid, team,
         off_ORTG_before, off_PPA_before, off_AST_pct_before,
         off_OREB_pct_before, off_TOV_pct_before)

# (2b) Defensive S2D (by def_team)
def_cum_before <- game_data %>%
  arrange(season, def_team, gamedate, nbagameid) %>%
  group_by(season, team = def_team) %>%
  mutate(
    c_poss_def    = cumsum(coalesce(possessions, 0)),     # defensive poss = opponent poss
    c_pts_allowed = cumsum(coalesce(points, 0)),
    c_steals_for  = cumsum(coalesce(stealsagainst, 0)),
    c_blocks_for  = cumsum(coalesce(blocksagainst, 0)),
    c_opp_fg2a    = cumsum(coalesce(fg2attempted, 0))
  ) %>%
  mutate(
    poss_def_before    = lag(c_poss_def),
    pts_allowed_before = lag(c_pts_allowed),
    steals_before      = lag(c_steals_for),
    blocks_before      = lag(c_blocks_for),
    opp_fg2a_before    = lag(c_opp_fg2a)
  ) %>%
  mutate(
    def_DRTG_before       = safe_div(pts_allowed_before, poss_def_before / 100),
    def_STL_rate_before   = safe_div(steals_before, poss_def_before),
    def_STL_per100_before = 100 * def_STL_rate_before,
    def_BLK_pct_before    = safe_div(blocks_before, opp_fg2a_before)
  ) %>%
  ungroup() %>%
  select(season, nbagameid, team,
         def_DRTG_before, def_STL_rate_before, def_STL_per100_before, def_BLK_pct_before)

# (2c) Merge S2D back to each row
game_with_s2d <- game_data_metrics %>%
  # focal OFF S2D
  left_join(
    off_cum_before,
    by = c("season", "nbagameid", "off_team" = "team")
  ) %>%
  # opponent DEF S2D
  left_join(
    def_cum_before %>%
      rename(
        opp_def_DRTG_before        = def_DRTG_before,
        opp_def_STL_rate_before    = def_STL_rate_before,
        opp_def_STL_per100_before  = def_STL_per100_before,
        opp_def_BLK_pct_before     = def_BLK_pct_before
      ),
    by = c("season", "nbagameid", "def_team" = "team")
  ) %>%
  # opponent OFF S2D
  left_join(
    off_cum_before %>%
      rename(
        opp_off_team               = team,
        opp_off_ORTG_before        = off_ORTG_before,
        opp_off_PPA_before         = off_PPA_before,
        opp_off_AST_pct_before     = off_AST_pct_before,
        opp_off_OREB_pct_before    = off_OREB_pct_before,
        opp_off_TOV_pct_before     = off_TOV_pct_before
      ),
    by = c("season", "nbagameid", "def_team" = "opp_off_team")
  ) %>%
  # focal pre-game NET rating (OFF S2D - opponent DEF S2D)
  mutate(
    focal_NET_RTG_before = off_ORTG_before - opp_def_DRTG_before
  )
# -------------------------
# (3) Aggregate team × season metrics
# -------------------------

# Offensive aggregates
off_agg <- game_data %>%
  group_by(season, team = off_team) %>%
  summarise(
    poss_off      = sum(possessions, na.rm = TRUE),
    pts_off       = sum(points, na.rm = TRUE),
    shotpts_sum   = sum(shotattemptpoints, na.rm = TRUE),
    shota_sum     = sum(shotattempts, na.rm = TRUE),
    ast_sum       = sum(assists, na.rm = TRUE),
    fgmade_sum    = sum(fgmade, na.rm = TRUE),
    oreb_sum      = sum(reboffensive, na.rm = TRUE),
    rebchance_off = sum(reboundchance, na.rm = TRUE),
    tov_sum       = sum(turnovers, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    ORTG     = safe_div(pts_off, poss_off / 100),
    PPA      = safe_div(shotpts_sum, shota_sum),
    AST_pct  = safe_div(ast_sum, fgmade_sum),
    OREB_pct = safe_div(oreb_sum, rebchance_off),
    TOV_pct  = safe_div(tov_sum, shota_sum + tov_sum)
  ) %>%
  select(season, team, ORTG, PPA, AST_pct, OREB_pct, TOV_pct)

# Defensive aggregates
def_agg <- game_data %>%
  group_by(season, team = def_team) %>%
  summarise(
    poss_def     = sum(possessions, na.rm = TRUE),
    pts_allowed  = sum(points, na.rm = TRUE),
    steals_for   = sum(stealsagainst, na.rm = TRUE),
    blocks_for   = sum(blocksagainst, na.rm = TRUE),
    opp_fg2a_sum = sum(fg2attempted, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    DRTG       = safe_div(pts_allowed, poss_def / 100),
    STL_rate   = safe_div(steals_for, poss_def),
    STL_per100 = 100 * STL_rate,
    BLK_pct    = safe_div(blocks_for, opp_fg2a_sum)
  ) %>%
  select(season, team, DRTG, STL_rate, STL_per100, BLK_pct)

# Combine and compute Net Rating
team_season_metrics <- off_agg %>%
  full_join(def_agg, by = c("season", "team")) %>%
  mutate(NET_RTG = ORTG - DRTG) %>%
  arrange(season, team)

# Schedule related data process

# Merge all four schedule-related tables
schedule_model <- sch_rest %>%
  left_join(
    travel_data %>% 
      select(season, gamedate, team, opponent, home, win, travel_km),
    by = c("season","gamedate","team","opponent","home","win")
  ) %>%
  left_join(
    schedule_b2b %>% 
      select(season, gamedate, team, opponent, home, win, b2b_flag),
    by = c("season","gamedate","team","opponent","home","win")
  ) %>%
  left_join(
    sch_tz %>% 
      select(season, gamedate, team, opponent, home, win, cross_tz),
    by = c("season","gamedate","team","opponent","home","win")
  )

schedule_model <- schedule_model %>%
  mutate(gamedate = as.Date(gamedate))

schedule_model <- schedule_model %>%
  arrange(team, season, gamedate) %>%
  group_by(team, season) %>%
  mutate(
    games_in_4 = slide_int(
      gamedate, 
      ~ sum(.x >= .x[length(.x)] - 3 & .x <= .x[length(.x)]),
      .before = Inf, 
      .complete = TRUE
    ),
    games_in_6 = slide_int(
      gamedate, 
      ~ sum(.x >= .x[length(.x)] - 5 & .x <= .x[length(.x)]),
      .before = Inf, 
      .complete = TRUE
    )
  ) %>%
  ungroup()


# Merge

s2d_model <- game_with_s2d %>%
  mutate(gamedate = as.Date(gamedate)) %>%
  select(
    season, gamedate, off_team, def_team,
    # focal team's S2D (pre-game)
    off_ORTG_before, off_PPA_before, off_AST_pct_before,
    off_OREB_pct_before, off_TOV_pct_before,
    # opponent defense S2D (already renamed in previous step)
    opp_def_DRTG_before, opp_def_STL_rate_before,
    opp_def_STL_per100_before, opp_def_BLK_pct_before,
    # opponent offense S2D
    opp_off_ORTG_before, opp_off_PPA_before,
    opp_off_AST_pct_before, opp_off_OREB_pct_before, opp_off_TOV_pct_before,
    # optional composite
    focal_NET_RTG_before
  )

model_data <- schedule_model %>%
  mutate(
    gamedate = as.Date(gamedate),
    time_zone_change = cross_tz,
    b2b = ifelse(is.na(b2b_flag), 0L, b2b_flag)
  ) %>%
  left_join(
    s2d_model,
    by = c("season", "gamedate", "team" = "off_team", "opponent" = "def_team")
  )

anti <- schedule_model %>%
  mutate(gamedate = as.Date(gamedate)) %>%
  anti_join(
    s2d_model, by = c("season", "gamedate", "team" = "off_team", "opponent" = "def_team")
  )
cat("Unmatched rows after join:", nrow(anti), "\n")
## Unmatched rows after join: 0
# Model
model_data <- model_data %>%
  filter(season >= 2019 & season <= 2023)
model_data <- model_data %>%
  mutate(b2b = ifelse(b2b == "second", 1, 0))

library(lme4)

model_main <- glmer(
  win ~ home + b2b + scale(days_rest) + scale(travel_km) + time_zone_change +
        opp_off_ORTG_before + opp_def_DRTG_before +
        factor(season) + (1 | team),
  data = model_data,
  family = binomial(link = "logit")
)

library(broom.mixed)

# Fixed effects table only
tidy(model_main, effects = "fixed", conf.int = TRUE) %>%
  select(term, estimate, std.error, conf.low, conf.high, p.value)
## # A tibble: 12 × 6
##    term                estimate std.error conf.low conf.high  p.value
##    <chr>                  <dbl>     <dbl>    <dbl>     <dbl>    <dbl>
##  1 (Intercept)          0.102     0.771    -1.41     1.61    8.95e- 1
##  2 home                 0.431     0.0413    0.350    0.513   1.69e-25
##  3 b2b                 -0.255     0.0526   -0.358   -0.152   1.24e- 6
##  4 scale(days_rest)     0.0114    0.0193   -0.0264   0.0492  5.53e- 1
##  5 scale(travel_km)    -0.0396    0.0209   -0.0806   0.00140 5.84e- 2
##  6 time_zone_change    -0.0295    0.0436   -0.115    0.0560  4.99e- 1
##  7 opp_off_ORTG_before -0.0812    0.00502  -0.0910  -0.0713  6.85e-59
##  8 opp_def_DRTG_before  0.0789    0.00528   0.0685   0.0892  1.58e-50
##  9 factor(season)2020   0.0110    0.0672   -0.121    0.143   8.70e- 1
## 10 factor(season)2021   0.00367   0.0631   -0.120    0.127   9.54e- 1
## 11 factor(season)2022   0.0116    0.0703   -0.126    0.149   8.69e- 1
## 12 factor(season)2023   0.0155    0.0757   -0.133    0.164   8.38e- 1
# Random effects variance
tidy(model_main, effects = "ran_pars")
## # A tibble: 1 × 4
##   effect   group term            estimate
##   <chr>    <chr> <chr>              <dbl>
## 1 ran_pars team  sd__(Intercept)    0.412
# Robustness: XGBoost classification

library(Matrix)
library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:plotly':
## 
##     slice
## The following object is masked from 'package:dplyr':
## 
##     slice
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
x_formula <- ~ 
  home + b2b + scale(days_rest) + scale(travel_km) + time_zone_change +
  opp_off_ORTG_before + opp_def_DRTG_before +
  opp_off_PPA_before + opp_off_AST_pct_before + 
  opp_off_OREB_pct_before + opp_off_TOV_pct_before +
  opp_def_STL_rate_before + opp_def_BLK_pct_before +
  factor(season) + factor(team) - 1

# Model matrix
X <- model.matrix(x_formula, data = model_data)
X[is.na(X)] <- 0
X <- Matrix(X, sparse = TRUE)

y <- model_data$win

# Train / test split (80/20)
n <- nrow(X)
idx_train <- sample.int(n, floor(0.8 * n))
idx_test  <- setdiff(seq_len(n), idx_train)

dtrain <- xgb.DMatrix(X[idx_train, ], label = y[idx_train])
dtest  <- xgb.DMatrix(X[idx_test,  ], label = y[idx_test])

# Cross-validated search for best nrounds with early stopping
params <- list(
  objective = "binary:logistic",
  eval_metric = "logloss",
  max_depth = 6,
  eta = 0.08,
  subsample = 0.8,
  colsample_bytree = 0.8,
  min_child_weight = 5,
  lambda = 1
)

cv <- xgb.cv(
  params = params,
  data = dtrain,
  nrounds = 2000,
  nfold = 5,
  stratified = TRUE,
  early_stopping_rounds = 50,
  verbose = 0
)

best_nrounds <- cv$best_iteration
cat("Best CV nrounds:", best_nrounds, "\n")
## Best CV nrounds: 32
# Train final model with best_nrounds
watchlist <- list(train = dtrain, test = dtest)
xgb_model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = best_nrounds,
  watchlist = watchlist,
  verbose = 0
)

# Simple diagnostics on test set
pred_prob <- predict(xgb_model, dtest)
roc_obj <- pROC::roc(y[idx_test], pred_prob, quiet = TRUE)
auc_val <- pROC::auc(roc_obj)
logloss <- -mean(y[idx_test] * log(pmax(pred_prob, 1e-15)) +
                 (1 - y[idx_test]) * log(pmax(1 - pred_prob, 1e-15)))
acc <- mean( (pred_prob >= 0.5) == as.logical(y[idx_test]) )

cat(sprintf("Test AUC = %.3f | Logloss = %.4f | Accuracy = %.3f\n",
            as.numeric(auc_val), logloss, acc))
## Test AUC = 0.567 | Logloss = 0.6849 | Accuracy = 0.544
# Feature importance (gain-based)
imp <- xgb.importance(model = xgb_model)
print(head(imp, 20))
##                     Feature       Gain       Cover  Frequency
##                      <char>      <num>       <num>      <num>
##  1:  opp_def_BLK_pct_before 0.09208077 0.109934475 0.10253456
##  2: opp_def_STL_rate_before 0.08273674 0.035058794 0.10368664
##  3:     opp_def_DRTG_before 0.07796805 0.089150418 0.08525346
##  4:     opp_off_ORTG_before 0.07658510 0.041327081 0.08755760
##  5:  opp_off_TOV_pct_before 0.06801714 0.056201328 0.07834101
##  6:         factor(team)DET 0.06755210 0.094276116 0.02304147
##  7: opp_off_OREB_pct_before 0.06572309 0.046530466 0.07949309
##  8:      opp_off_PPA_before 0.06274678 0.030405600 0.07603687
##  9:  opp_off_AST_pct_before 0.05636446 0.030881886 0.07603687
## 10:        scale(travel_km) 0.04246523 0.024654053 0.07027650
## 11:      factor(season)2019 0.03616496 0.003091843 0.02188940
## 12:         factor(team)DEN 0.02757436 0.054210137 0.01267281
## 13:         factor(team)HOU 0.02666309 0.049261366 0.01152074
## 14:         factor(team)MIL 0.02445965 0.045026194 0.01036866
## 15:         factor(team)SAC 0.02139142 0.058457442 0.01497696
## 16:      factor(season)2020 0.02120323 0.002580288 0.01958525
## 17:         factor(team)BOS 0.01841763 0.044471013 0.01152074
## 18:         factor(team)PHX 0.01706500 0.038889010 0.01036866
## 19:         factor(team)CHA 0.01690314 0.031449264 0.01036866
## 20:      factor(season)2021 0.01461757 0.004378490 0.01497696
##                     Feature       Gain       Cover  Frequency
# Plot
xgb.plot.importance(imp[1:min(20, nrow(imp)), ])

library(lme4)

sc_days <- attr(scale(model_data$days_rest), "scaled:center")
sc_travel <- attr(scale(model_data$travel_km), "scaled:center")

model_data <- model_data %>%
  mutate(
    # actual
    p_actual = predict(model_main, newdata = ., type = "response", re.form = NA),

    # counterfactual
    p_cf = predict(
      model_main,
      newdata = mutate(., 
                       home = 0,
                       b2b = 0,
                       days_rest = sc_days,
                       travel_km = sc_travel,
                       time_zone_change = 0),
      type = "response",
      re.form = NA
    ),

    schedule_effect = p_actual - p_cf
  )

team_schedule_wins <- model_data %>%
  group_by(team) %>%
  summarise(schedule_wins = sum(schedule_effect, na.rm = TRUE)) %>%
  arrange(desc(schedule_wins))

most_helped <- team_schedule_wins %>% slice_max(schedule_wins, n = 1)
most_hurt   <- team_schedule_wins %>% slice_min(schedule_wins, n = 1)

cat(sprintf(
  "Most Helped by Schedule: %s (%.1f wins)\nMost Hurt by Schedule: %s (%.1f wins)\n",
  most_helped$team, most_helped$schedule_wins,
  most_hurt$team,   most_hurt$schedule_wins
))
## Most Helped by Schedule: WAS (15.1 wins)
## Most Hurt by Schedule: POR (12.4 wins)

Model explanation:

The game outcome is estimated with a mixed-effects logistic regression, with win probability as the dependent variable. The model includes key schedule-related covariates (back-to-back status, days of rest, travel distance, and time zone changes) and opponent strength (season-to-date offensive and defensive ratings), plus season fixed effects and team random intercepts. I excluded cumulative opponent metrics such as assist rate, turnover rate, and block rate, which were highly collinear with core offensive/defensive ratings and made the model unstable. This choice prioritizes interpretability and convergence while retaining the most meaningful measures of schedule burden and opponent quality. As a robustness check, an XGBoost model produced similar variable importance rankings, highlighting opponent defense and travel distance. Diagnostics (convergence, feature importance, and residual patterns) indicate stable estimates, supporting the reliability of the findings.

  • Most Helped by Schedule: WAS (15.1 wins)
  • Most Hurt by Schedule: POR (12.4 wins)

Disclaimer

This project is not meant to be super rigorous, and there might be mistakes here and there. This project is just for fun! I just wanted to see what I could do with my data science skills and applied them to something I’m interested in. I had a lot of fun making it :)